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Abstract — The  study  of  deception  and  the  theories  which  have 
been  developed  have  relied  heavily  on  laboratory  experiments,  in 
controlled  environments,  utilizing  American  college  students, 
participating  in  mock  scenarios.  The  goal  of  this  study  was  to 
validate  previous  deception  research  in  a  real-world  high-stakes 
environment.  This  study  utilized  previously  confirmed  speech 
cues  and  constructs  to  deception  in  an  attempt  to  validate  a 
leading  deception  theory,  Interpersonal  Deception  Theory  (IDT), 
The  results  did  validate  IDT  with  mixed  results  on  individual 
measures  and  their  constructs. 
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I.  Introduction 

Deception  is  a  ubiquitous  form  of  communication  [1],  In 
fact  deception  is  a  major  characteristic  of  the  most  common 
communication  channels;  14%  of  people  self-reported 
deceiving  in  emails,  37%  in  phone  calls,  and  27%  in  face-to- 
face  interactions  [2],  The  formal  study  of  deception  detection 
and  its  cues  has  been  covered  in  numerous  cross  discipline 
studies  and  the  consensus  across  the  board  is  that  humans  are 
poor  detectors  of  deceit  [3],  [4],  [5],  In  perhaps  the  most 
comprehensive  meta-analysis  of  deception  detection  cues  and 
their  accuracy,  Bond  and  DePaulo  [3],  looked  at  206  studies 
with  24,483  judgments  and  found  a  mean  accuracy  of  53.4%. 
To  be  more  colloquial,  humans  might  as  well  flip  a  coin  when 
it  comes  to  detecting  deception.  However,  humans  are  not  just 
inaccurate  detectors  of  deceit  but  poor  judges  of  what  cues  are 
indicators  of  deception  and  are  often  affected  by  multiple 
biases  [6],  Human  bias  toward  unreliable  deceptive  cues 
hampers  our  ability  to  perceive  deception  and  can  further 
decrease  accuracy  below  chance  [6],  Therefore  humans  have 
long  searched  for  behaviors  and  tools  to  aid  them  in  detecting 
deceit. 

Current  methods  to  detect  deception  all  have  drawbacks 
and  can  be  split  into  two  categories,  invasive  and  non-invasive. 
Of  the  invasive  technologies  currently  available  to  help 
identify  and  measure  deceit,  the  polygraph  is  the  most  well- 
known.  The  polygraph  is  a  device  that  takes  various  cardiac, 
skin  conductivity,  and  respiratory  measures  to  detect  deception. 

It  is  based  on  the  idea  that  these  physiological  measures  are 
directly  linked  to  the  conditions  that  are  brought  on  by 
deception  attempts  [6J.  in  a  summary  of  laboratory  tests,  Vrij 
reports  that  the  polygraph  is  about  82%  accurate  at  identifying 
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deceivers  [6],  However,  polygraph  exams  have  several  strong 
limiters  namely  a  willing  subject,  an  invasive  exam,  and  the 
need  for  a  trained  examiner.  The  polygraph  exam  itself  can 
evoke  fear  and  apprehension  in  its  subjects  making  it  a 
controversial  investigative  tool. 

The  newest  invasive  method  to  detect  deception  utilizes 
functional  magnetic  resonance  imaging  <fMRI)  to  map  blood 
flow  in  the  brain  during  structured  questioning.  One  fMRI 
study  reached  deception  detection  accuracy  of  100%  when 
subjects  do  not  employ  countermeasures  [7].  fMRI  measures 
the  hemodynamic  response,  or  changes  in  blood  flows,  that  are 
related  to  brain  activity.  Researchers  have  noticed  differences 
between  the  brain  activity  of  truth-tellers  and  deceivers  [8],  [9). 
[10],  Though  initial  findings  are  promising,  fMRI  shares  the 
same  restrictions  as  the  polygraph  (a  willing  subject,  an 
invasive  exam,  and  the  need  for  a  trained  examiner). 
Additional  limiters  to  their  general  use  are  their  sheer  size  their 
cost  to  operate,  the  fact  that  subjects  cannot  move  at  all,  and 
they  cannot  be  used  on  people  with  claustrophobia  or  metallic 
implants.  These  leading  deception  detection  tools  are 
prohibitive  [II]  and  emphasize  the  need  for  less  obtrusive 
means  of  measuring  deceptive  behavior  that  do  not  require 
human  intervention. 

In  addition  to  prohibitive  tools,  the  current  methodology 
used  in  most  deception  detection  research  is  lacking  in  areas 
that  separates  it  from  real-world  settings  [12].  A  vast  majority 
of  current  deception  detection  research  utilize  American 
university  students  instructed  to  lie  in  mock  scenarios  [12] 
[13]  But  research  in  high-stakes  environments,  such  as 
interviews  during  a  criminal  investigation,  is  deficient  [14], 
[12],  [15],  [16],  [17],  [18].  This  has  driven  a  strong  need  for 
more  field  studies  in  deception  detection  research  [19], 

A  principal  deception  detection  meta-analysis  of  120 
studies  showed  101  used  student  subjects  [12],  Only  four  of 
these  studies  (3%)  involved  situations  where  the  subjects  were 
not  given  instructions  to  lie  but  chose  to  do  so  on  their  own. 
There  is  evidence  that  behavior  differs  between  those  who 
choose  to  lie  and  those  directed  to  lie  by  an  experimenter  [20], 
For  example,  those  who  chose  to  lie  compared  to  those 
instructed  to  lie  made  fewer  speech  errors  and  hesitations,  and 
fewer  references  to  others.  Therefore,  studies  utilizing  real- 
world  samples  of  subjects  who  either  chose  to  be  deceptive  or 


not  may  contribute  more  deeply  to  the  understanding  of 
deception  than  those  studies  utilizing  mock  lie  scenarios,  as 
well  as  provide  more  generalizable  findings.  Another  criticism 
of  mock  lies  is  on  the  lack  of  motivation;  participants  have 
little  to  lose  and  do  not  chose  to  lie  hence  have  little  or  no 
vested  interested  in  whether  or  not  they  get  caught  [21].  A  lack 
of  personal  involvement  in  the  lie  is  another  critique  of 
laboratory  studies  [22]. 

Research  has  also  shown  the  duration  and  content  of  a  lie 
can  influence  how  successful  a  person  can  be  at  deception. 
Longer  lies,  for  instance,  are  more  difficult  to  tell  than  short 
ones  [23].  The  idea  that  longer  lies  are  more  complex  and 
difficult  to  maintain  than  short  and  simple  lies  is  practically 
common  sense.  In  the  meta-analysis  done  by  DePaulo  et  ah, 
[12]  they  predicted  that  if  deceivers  were  required  to  sustain 
their  deception  for  greater  lengths  of  time,  then  cues  to 
deception  would  be  clearer  and  more  numerous.  Their  findings 
supported  their  hypothesis;  duration  did  moderate  the  size  of 
the  effect.  However.  RWHS  situations,  like  law  enforcement 
interviews,  may  be  longer  than  subjects  will  tolerate  in  a 
controlled  experiment  (e.g.  the  duration  of  the  RWHS 
interview  in  this  study  lasted  14  hours  over  three  days). 

Although  RWHS  deception  detection  research  could 
address  these  issues,  it  must  overcome  the  wicked  problem  of 
establishing  ground  truth  [24].  Ground  truth  is  a  verified  or 
indisputable  fact,  for  example  adhering  to  evidentiary 
guidelines  used  in  a  court  of  law.  In  a  laboratory  setting, 
establishing  ground  truth  is  a  matter  of  experimental  design, 
fully  controlled  by  the  researcher.  This  same  control  is  not 
possible  in  the  real-world  and  to  attempt  to  subject  people  to 
real  stressors  that  would  lead  up  to  deceptive  communication 
w'ouid  be  unethical  and  most  likely  illegal  (e.g.  ask  a  student  to 
steal  a  computer  from  the  schools  lab  and  then  monitor  them 
during  police  interviews).  In  addition,  random  assignment  of 
participants  to  treatment  groups  is  not  possible  in  field  studies. 
Because  of  the  wicked  problem  of  establishing  ground  truth  in 
RWHS  deception  and  the  unethical  feasibility  of  laboratory 
experiments,  case  studies  based  on  field  data  seem  to  be  the 
experimental  design  with  the  greatest  chance  to  further  the 
understanding  of  deception  detection. 

Another  issue  with  the  body  of  deception  detection  research 
is  that  current  theories  lack  adequate  validation  in  RWHS 
settings  [6],  [25].  One  promising  theory  that  can  benefit  from 
validation  in  a  RWHS  is  Interpersonal  Deception  Theory  (IDT) 
[26]  (Figure  1).  According  to  IDT,  the  counterpart  to  senders’ 
deception  is  receivers’  suspicion.  IDT  suggests  that  deception 
is  a  dyadic  interaction  and  as  the  deception  takes  place, 
receivers  may  become  suspicious  of  the  senders  attempts  to 
deceive  and  may  adapt  their  behavior  because  of  it. 

For  example,  they  may  choose  to  conceal  their  suspicion  by 
quickly  moving  on  to  another  topic  or  admit  their  suspicion 
and  confront  the  sender  to  gauge  their  reaction.  While  IDT 
already  has  empirical  support  [27],  [28],  examining  it  under  a 
RWHS  lens  can  strengthen  its  validation. 

This  study  is  an  exploration  of  real-world,  high-stakes 
(RWHS)  deceptive  behavior  manifested  in  human  speech,  and 
analyzed  by  objective  measures. 


Figure  1,  Interactive  Deception  Model  (Adapted) 

It  is  not  a  laboratory  experiment  with  controlled  settings  in 
a  closed  environment,  rather  it  is  more  akin  to  a  field  study. 
Though  several  statistical  tools  were  employed  and  every 
opportunity  to  follow  sound  methodology  was  practiced,  their 
use  was  not  to  prove/disprove  hypothesizes  but  to  explore  the 
data  and  examine  propositions  based  on  theory,  namely  IDT. 
The  impetus  for  this  study  came  after  a  lengthy  literature 
review  on  deception  detection  and  has  three  tenets;  (1)  the  state 
of  existing  theories  on  deception  crave  for  validation,  (2) 
outside  the  lab  in  a  RWHS  setting,  (3)  where  typical  dyadic 
interactions  are  long  and  more  complex  than  those  studied  in  a 
controlled  setting.  These  tenets  are  the  research  gaps  identified 
and  where  it  is  believed  the  most  stands  to  be  gained  by 
exploration.  In  doing  so,  it  serves  to  validate  the  phenomenon 
posited  by  IDT. 

II.  Methodology 

Primarily,  the  study  seeks  to  answer  the  question:  Are 
speech  cues  to  deceptive  behavior  moderated  over  time  by 
receiver  suspicion  during  dyadic  interactions  in  a  real-world 
high-stakes  setting?  In  order  to  test  this  research  question  we 
first  identified  a  communication  channel  that  was  easily 
measured  with  automated  tools  but  also  rich  in  behavioral  cues. 
Speech  is  such  a  channel.  Deception  researchers  have  long 
been  interested  in  speech  as  a  source  of  behavioral  cues  [29], 
[30],  [31],  [32],  [33].  Once  the  communication  channel  was 
identified,  a  list  of  cues  and  their  constructs  was  chosen  which 
previously  research  reported  to  be  good  indicators  of 
deception. 

Speech  can  be  split  into  two  categories,  linguistic  and 
paralinguistic.  Linguistics  is  the  study  of  what  someone  says 
and  paralinguistics  is  how  they  say  it.  Our  initial  measures 
contained  linguistic-based  constructs  from  Fuller,  Biros,  and 
Wilson  [34],  Fuller  et  al.’s  study  looked  at  370  written  suspect 
statements  given  during  law  enforcement  interviews  following 
criminal  cases.  Fuller  et  al.  s  constructs  and  measurements 
were  chosen  because  they  generated  almost  74%  accuracy  in 
deception  detection,  the  data  was  RWHS  field  data  taken  in 
law  enforcement  environments  with  solid  ground  truth 
validation,  and  the  units  of  measure  were  written  statements. 


This  matches  the  current  data  set  with  the  exceptions  that  it  is  a 
transcript  of  a  iaw  enforcement  interview  and  the  unit  of 
measure  varies  from  individual  words  to  multiple  sessions. 
The  seven  constructs  used  in  this  study  are  listed  in  Table  I. 


Table  1,  Linguistic  Constructs  &  Measures 


Construct 

Construct  Measurement 

Brief  Description 

Quantity 

#of  Words,  Verbs,  & 

Sentences 

Length  of  message 

Specificity 

Sensory  ratio.  Spatial 
ratio.  Temporal  ratio. 
Content  Word  Diversity, 

Bi logarithmic  Type- 
Token-Ratio 

Amount  and  type  of 
details  in  the  message 

Uncertainty 

Certainty  Terms, 
Tentative  Terms,  Modal 
Verbs,  Passive  Voice, 
Generalizing  Terms 

Relevance,  directness, 
and  certainty  of  message 

Clarity 

Redundancy,  Sentence 
Length,  Complexity 
Ratio,  Average  Word 
Length,  Causation  Terms,  | 

Message  clarity  and 
comprehensibility 

Immediacy 

1st  person  pronouns, 

2nd  person  pronouns, 

3rd  person  pronouns 

Attempts  to  disassociate 
oneself  from  the  events 
described 

Affect 

Activation,  Imagery, 
Pleasantness* 

Emotions  present  in  the 
message 

Cognitive 

Processing 

Exclusive  Verbs,  Motion 
Words,  Cognitive 
Processing  Terms. 

Increased  or  decreased 
cognitive  processing  and 
cognitive  information 
present  in  the  message 
related  to  veracity 

*  Note,  Fuller  [34]  used  positive  and  negative  measures  for 
each  Affect  measure,  this  study  combines  the  positive  and 
negative  into  a  single  bi-polar  measure  for  ease  of  processing. 
In  addition  to  the  seven  linguistic  constructs  by  Fuller,  listed 
above,  an  eighth  construct  of  Severity  was  also  considered  by 
them  to  be  important.  However  it  is  not  a  part  of  the  current 
study  because  its  measure  would  be  constant  across  the  current 
data  set.  The  current  data  comes  from  a  serial  rapist,  the 
punishment  for  which  was  life  in  prison.  The  lead  detective  in 
this  case  would  assign  the  maximum  severity  score  of  five  on 
the  one  to  five  scale  used  by  Fuller. 

For  the  paralinguistic  measures  we  looked  at  the  vocal 
constructs  examined  by  Meservy  [35],  These  constructs  and 
their  measures  were  selected  for  this  study  because  they 
represent  a  thorough  coverage  of  the  audio  channel  and  tools 
exist  to  measure  each.  The  six  constructs  were:  Fluency, 
Duration,  Tempo,  Intensity,  Frequency,  and  Voice  Quality 
[29],  [12],  However,  because  the  construct  Voice  Quality 
contains  cues  that  are  difficult  to  measure  objectively  without 
the  aid  of  a  human  evaluation  this  construct  was  removed;  a 
focus  of  this  study  is  on  identifying  behavioral  cues  that  can  be 
objectively  measured  and  potentially  automated.  The  five 
constructs  and  their  1 4  measures  are  described  Table  2. 


Table  2,  Paralinguisit  Constructs  &  Measures 


Construct 

Construct 

Measurement 

Brief  Description 

Fluency 

1 .  Non-ah 
disturbances 

2.  Speech  errors 

3.  Silent  pauses 

4.  Filled  pauses 

1-  Speech  disturbances  other  than 

“unf\  “er\  "ah”,  and  other  such 
words 

2.  General  speech  errors 

3>  4.  Various  pauses  in 
conversation 

Duration 

1 ,  Length  of 
interaction 

2,  Response 
length 

3,  talking;  time 

1 ,  Total  time  of  dyadic  interaction 

2,  Length  of  sender's  response 

3,  Proportion  of  total  time  sender 
talks 

Tempo 

h  Rate  of 
speaking 

2,  Rate  change 

1 .  Average  number  of  words  per 
minute 

2.  Rate  of  speaking  in  the  epoch 
minus  the  average  rate  of 
speaking  for  all  responses 

Intensity 

L  Amplitude 

2.  Amplitude 
variety 

!  loudness  of  senders  voice 

2.  variation  of  loudness  of  a 
sender's  voice 

Frequency 

1.  Pitch 

2.  Pitch  change 

3.  Pitch  variety 

1.  The  average  fundamental 
frequency  of  sender's  voice 

2.  variation  of  pitch  of  a  sender’s 
voice 

3.  Frequency  of  changes  of  pitch 
of  a  sender’s  voice 

*note,  the  measures  Interruptions  from  the  construct 
Fluency  and  Response  Latency  from  the  construct  Duration, 
are  not  considered  due  to  the  difficulty  in  automating  these 
measures.  Interruptions  in  the  current  study  were  removed 
because  splitting  speakers  in  a  single  channel  audio  recording 
is  extremely  difficult  [36].  However,  methods  do  exist  for 
speaker-based  segmentation  which  could  be  explored  in  future 
research  [37], 

Once  a  list  of  deceptive  behavioral  cues  was  identified  we 
had  to  find  a  situation  that  met  all  the  requirements  of  a  RWHS 
dyadic  interaction.  There  had  to  be  a  dyadic  interaction 
between  a  sender  and  receiver  whereby  the  sender  might  adjust 
his  deceptive  behavior  when  the  receiver  became  suspicious  of 
the  sender’s  message.  Fortunately,  there  is  just  such  a  RWHS 
that  meets  that  criteria;  the  interview  between  an  investigating 
police  officer  and  a  suspect  in  a  criminal  case.  What  follows  is 
a  description  of  the  case. 

Hi.  Case  Description 

Please  note,  this  case  has  been  adjudicated  and  all 
identifiable  information  is  publically  available  upon  proper 
request.  In  Nov  2004,  James  Perry  was  sentenced  in  federal 
court  in  Madison,  Wisconsin  to  470  years  in  prison  for  creating 
child  pornography,  rape,  sexual  exploitation  of  children,  child 
sexual  assault  and  kidnapping;  a  crime  spree  that  spanned  over 
a  five  years  and  four  states.  It  is  the  longest  sentence  for  sex 
crimes  in  Wisconsin  history  and  there  is  no  option  for  parole. 

in  2004  James  Perry  committed  his  final  assault  which  led 
to  his  capture.  Perry,  a  husband  and  father  of  two  young  girls, 
entered  a  Madison,  Wisconsin  hotel  with  the  intent  of 
committing  a  sexual  assault  against  a  13  year  old  girl.  This 


incident  was  only  one  of  two  times  Perry  was  ever  caught  on 
film  despite  targeting  very  public  locations.  It  was  a  key  piece 
linking  him  to  a  long  series  of  rapes  and  assaults.  At  the  same 
time  The  FBI  was  investigating  a  child  pornography  ring  of 
which  Perry  was  involved.  Only  a  few  days  after  the  assault 
and  attempted  abduction  of  the  young  girl  the  FBI  arrested 
Perry  for  his  involvement  in  the  internet  child  pornography 
ring. 

The  lead  detective  from  the  Madison  Police  Department 
(PD)  became  aware  that  the  serial  rapist  she  had  been  hunting 
was  in  FBI  custody.  The  FBI  was  not  aware  of  any  rape  or 
assault  charges  at  that  time.  The  lead  detective  informed  the 
FBI  about  the  plethora  of  crimes  he  committed  and  all  plea 
bargaining  on  federal  charges  stopped  so  the  lead  detective 
could  conduct  the  interview.  The  Madison  PD  had  a  list  of  45 
victims  but  believed  there  were  hundreds  more.  Perry  was 
highly  motivated  to  lie  because  before  the  interview  he  was 
trying  to  proffer  a  plea  agreement  with  the  FBI  for  only  a  few 
years  in  prison  in  exchange  for  testifying  against  others  in  the 
child  pornography  ring.  If  additional  charges  for  rape,  sexual 
exploitation  of  children,  child  sexual  assault  and  kidnapping 
were  added,  all  plea  bargaining  would  stop  and  he  would  be 
facing  life  in  prison;  an  environment  particularly  not  friendly  to 
child  molesters.  Only  after  the  interview  and  when  Perry 
became  aware  of  all  the  evidence  against  him  was  a  plea 
agreement  made  to  stop  adding  on  charges  (over  and  above  the 
1 25  he  was  now  being  charged  with)  because  he  was  now  most 
definitely  going  to  prison  for  life. 

Law  enforcement  videotaped  three  consecutive  days  of 
interviews  totaling  14  hours  and  27  minutes.  Interviews  were 
conducted  by  the  same  lead  detective  and  her  paltrier  in  the 
same  room  and  under  the  same  conditions  with  Mr.  Perry  and 
his  attorney.  Interaction  was  primarily  between  the  lead 
detective  and  Mr.  Perry,  only  minor  contributions  (less  than 
five  minutes  total)  were  made  by  the  second  detective  and  Mr. 
Perry’s  attorney;  their  voices  were  removed  before  analysis.  A 
200  page  law  enforcement  transcript  was  generated  by  the  lead 
detective  immediately  after  the  interview's.  The  law 
enforcement  transcript  contains  all  questions  asked  and  the 
responses,  often  in  quotations  with  additional  pertinent  notes 
by  the  lead  detective.  Both  the  videotaped  interviews  and  law 
enforcement  transcripts  were  used  in  federal  court.  On  the  first 
day,  the  interview  lasted  just  over  four  hours  and  10  minutes, 
during  which  711  individual  questions  were  asked  covering 
209  different  topics.  The  quality  of  the  audio  was  very  poor;  a 
single  microphone  in  a  noisy  room,  typical  for  this  setting. 

Ground  truth  was  established  based  on  credible  evidence 
admissible  in  a  federal  court.  The  lead  detective  identified  four 
types  of  statements:  (1)  the  truth,  (2)  suspected  lies  without 
evidence,  (3)  suspected  lies  with  evidence,  and  (4)  confirmed 
lies.  Confirmed  lies  were  those  statements  proven  to  be  false 
by  indisputable  evidence  admissible  in  court.  When  the  sender 
made  these  statements  law  enforcement  personnel  knew  for  a 
fact  he  was  lying.  Suspected  lies  with  evidence  were  those 
statements  law  enforcement  personnel  had  disputing  evidence 
on.  however  for  various  reasons  that  evidence  was  not  or  could 
not  be  admitted  into  federal  court.  Suspected  lies  without 
evidence  were  those  statements  law  enforcement  personnel 
believed,  in  their  expert  opinion,  to  be  false  but  for  which  they 


had  little  or  no  evidence.  The  final  type  of  statements  are 
truthful,  were  the  law  enforcement  personnel  knew  were  the 
truth  or  had  no  reason  to  believe  they  were  false. 

Given  a  set  of  behavioral  speech  cues  and  their  constructs, 
a  clear  definition  of  ground  truth  and  a  suitable  RWHS  data 
set,  the  next  step  was  to  run  the  analysis  to  determine  it  these 
cues  were  moderated  over  time  by  receiver’s  suspicion. 

IV.  ANALYSIS 

The  data  preparation  process  followed  the  steps  shown  in 
Figure  2.  First,  the  raw  video  stored  on  DVD  was  processed 
with  Adobe  Soundbooth  to  isolate  the  audio  from  the  video 
portion;  there  was  no  loss  of  audio  data  during  this  step.  The 
digital  audio  files  were  then  passed  through  DC  Live  Forensic 
7.5  to  improve  audibility  in  preparation  for  segmentation. 
Global  filters  were  applied  to  remove  audio  signals  outside  the 
abilities  of  humans  to  hear  as  well  as  make.  It  should  be  noted 
that  any  filters  or  transformations  to  improve  audibility  were 
applied  universally.  It  should  also  be  noted  that  all  recording 
took  place  in  the  same  room  with  the  same  recording  device 
and  same  environmental  settings.  Once  global  filters  removed 
noise  outside  human  speech  range  and  audibility  quality  was 
improved,  audio  was  segmented  into  question/response  pairs 
and  grouped  by  topic. 

The  audio  was  then  duplicated  for  split  processing  for  the 
two  categories  of  cues,  linguistic  and  paralinguistic.  In 
preparing  the  audio  for  linguistic  transcription  any  audio  or 
acoustic  filter  can  be  applied  that  improves  transcription 
accuracy  (i.e.  pitch,  tone,  cadence,  etc.  have  no  impact  on 
linguistic  cues).  Linguistic  cues  were  measured  from  the 
transcript  using  Structured  Programming  for  Linguistic  Cue 
Extraction  (SPLICE)  and  Linguistic  Inquiry  and  Word  Count 
(L1WC)  software.  Waikato  Environment  for  Knowledge 
Analysis  (WEKA)  (Witten  &  Frank,  2000)  is  used  for 
classification  based  on  the  initial  text  processing  steps.  This 
transcript  was  then  compared  to  the  law  enforcement  transcript 
and  deceptive  statements  wrere  coded  into  the  full  transcript. 


Figure  2,  Data  Processing  Steps 

The  goal  of  processing  the  data  for  paralinguistic  cue 
measurement  is  the  removal  of  noise  without  removing, 
degrading,  or  changing  the  speech  signal.  There  are  several 
techniques  for  removing  and  improving  clarity  of  audio 
however,  some  can  be  very  aggressive  and  rely  on  human 
physical  and  cognitive  audio  processing  characteristics  to 
“trick”  the  listener  into  hearing  clearer  voices.  This  study  took 
a  conservative  approach  to  audio  filter  selection  to  retain  as 


much  of  the  voice  signal  as  possible,  DC  Live  Forensic  7.5 
was  the  primary  audio  tool  used  for  processing  the 
paralinguistic  measures. 

The  measures  consist  of  41  total  measures  across  the  12 
deception  detection  constructs.  The  linguistic-based  cue 
constructs  are:  Quantity,  Specificity,  Uncertainty,  Clarity, 
Immediacy,  Affect,  and  Cognitive  Processing.  Paralinguistic- 
based  cue  constructs  are:  Time,  Intensity,  Frequency,  Fluency, 
and  Duration  (Table  3).  For  this  paper  we  pay  particular 
attention  to  the  linguistic  construct  Quantity  and  its’  three 
measures,  the  number  of  words,  verbs,  and  sentences. 

Looking  at  Construct  means  required  converting  the 
individual  measures  to  z-scores  and  averaging  for  each 
response.  The  following  raw  score  mean  tables  give  a  good 
initial  understanding  of  the  spread  of  the  data.  For  example,  # 
of  Words  averaged  just  over  43  with  truthful  statements,  less  at 

and  deceitful  statements,  and  much  more  at  66  words  on 
average. 


Table  3,  Descriptive  Means  of  Constructs 


- - -  - 

Constructs 

- r-TT— - - —  _ _ _ 

Mean  0  (Z-Score) 

Truthful 

i 

Deceitful 

l 

Quantity 

-0.096 

a 

0.483 

t 

Quantity 

0.068 

T 

-0.312 

l 

Specificity 

-0.018 

1 

0.096 

T 

Uncertainty 

-0.034 

1 

0.181 

t 

Clarity 

-0.028 

A 

0.151 

T 

Immediacy 

-0.02 1 

A 

0.138 

t 

Affect 

-0.049 

i 

0.266 

T 

Cognitive  Proc. 

-0.002 

A 

0.013 

T 

Fluency 

-0.077 

1 

0.398 

r 

Duration 

0.022 

t 

-0.081 

a 

Tempo 

-0.013 

A 

0.048 

T 

Intensity 

0.013 

T 

-0.102 

A 

Overall  the  mean  scores  for  all  but  three  constructs  (75%) 
increased  during  deceitful  behavior  while  truthful  behavior 
showed  a  decrease  in  construct  z-scores.  Because  all  of  the 
constructs  are  reflective  (vs  formative)  it  follows  that  changes 
in  the  individual  cues  reflect  the  changes  in  the  latent 
constructs  as  seen  in  the  following  table  [38].  [39], 


Table  4,  Descriptive  Means  of  Measures 


Cues 

Mean 

(Raw) 

Truth 

I 

Lie 

1 

#  of  Words 

43.72 

39.25 

1 

66.42 

t 

#  of  Verbs 

3.33 

3.00 

A 

5.03 

T 

#  of  Sentences 

8.31 

7.41 

A 

12.82 

t 

Sensory  ratio 

0.79 

0,8 1 

t 

0.65 

A 

Temporal  ratio 

4.77 

4.71 

A 

5.15 

T 

Cues 

Mean 

(Raw) 

Truth 

1 

Lie 

1  t 

Content  Diversity 

1  0.80 

0.81 

,  T 

0.73 

[A 

^TT-Ratio 

^79.37 

80.74 

T 

73.08 

[7 

Certainty  Terms 

3,022 

3.03 

T 

2.91 

a 

Tentative  Terms 

3.11  I 

3.07 

i 

3.45 

t 

,  Modal  Verbs 

10.49 

10.24 

A 

1  1.82 

T 

Passive  Voice 

0.01 

0.00 

T 

0.00 

A 

Gen,  Terms 

2.33 

2.27 

A 

2.44 

T 

Redundancy 

18.92 

18.66 

A 

20.32 

St 

Sentence  Length 

12.66 

12.53 

A 

13.49 

t 

Complexity  Ratio 

2.51 

2.51 

I 

2.51 

r 

Avg  Word  len. 

3.82 

3.82 

T 

3.78 

A 

Causation  Terms 

1.03 

0.85 

A 

1.99 

t 

1  st  p.  pronouns 

9.63 

9.36 

A 

1  1.29 

T 

2nd  p.  pronouns 

0.74 

0.71 

A 

0.77 

T 

3rd  p,  pronouns 

3.03 

3.02 

A 

3.1  ! 

T 

!  Activation 

1.59 

1.58 

A 

1.66 

T 

Imagery 

1.40 

1.39 

A 

1.44 

T 

Pleasantness 

1.73 

1.72 

1 

1.78 

T 

Exclusive  Verbs 

3.13 

2.94 

i 

4.09 

t 

Motion  Words 

2.10 

2.04 

A  . 

2.50 

t 

Cog,  Proc.  Terms 

16.82 

16.39 

A 

19.25 

t 

Non-ah  distur. 

2.30 

2.19 

A  j 

3.05 

r 

Speech  errors 

0.010 

0.0  1 0 

t 

0.01 

A 

Silent  pauses 

0.103 

0.09 

A 

0.11 

A 

Filled  pauses 

2.010 

2.05 

t 

1.89 

A 

Interaction  len. 

20.94 

19.91 

A 

26.33 

t 

Response  len. 

13.00 

i  1.78 

A 

19.27 

T 

Talking  time 

13.00 

11.78 

1 

19.27 

t  1 

Rate  of  speaking 

4.94 

4.9  I 

1 

5.06 

t1 

Rate  change 

0.65 

0.69 

T  , 

0.51 

A 

Amplitude 

53.72 

53.68 

53  +  88 

t 

Amp.  variety 

0.014 

0.01 

0.014 

t 

Pitch 

135.1  1 

136.5 

125.6 

A 

Pitch  change 

0.053 

0.05 

0.051 

A 

Pitch  variety 

49.55 

49.26 

51.05 

t 

Overall  70.7%  of  the  measures  showed  increases  during 
deceptive  responses  reflecting  a  general  rise  in  behavior 
measures  (Table  4).  This  could  be  explained  by  deceiver’s 
tendency  to  over  compensate  because  he  is  anxious  to  appear 


honest  [34],  To  better  understand  the  differences  between  the 
group  means,  ANOVA  was  run  on  all  constructs  and 
measures. 


poor  quality  of  most  RWHS  audio  recordings,  paralinguistic 
measures  may  be  the  measures  of  choice  in  those 
environments. 


An  initial  step  to  reporting  ANOVA  results  should  be  to 
define  what  is  "extreme”.  In  other  words,  what  is  the  cutoff 
value  of  a  level  of  significance  given  the  nature  of  the  study. 
Most  linguistic  and  psychol ingui Stic  as  well  as  MIS  journals 
enforce  the  conventional  a  of  0.05  [40],  Because  of  the 
exploratory  nature  of  this  study  Type  II  errors  (failing  to  reject 
when  the  null  hypothesis  is  in  fact  false)  are  more  acceptable 
than  Type  !  (rejecting  the  null  hypothesis  when  in  fact  it  is 
true),  in  practical  terms,  believing  a  treatment  has  an  effect 
when  in  fact  there  is  none  (Type  11  error)  is  less  damaging  than 
dismissing  a  treatment  that  in  fact  has  an  effect  (Type  I  error) 
[40],  Furthermore,  given  the  uncontrolled  environment  from 
which  the  data  was  collected,  a  more  relaxed  a  of  0.10  is 
adopted.  The  ANOVA  of  the  41  behavioral  cues  measured 
29.3%  as  significant  at  the  Q/R  pair  epoch  level.  Given  the 
poor  quality  of  the  audio  data,  this  is  strong  support  for 
utilizing  these  measures  in  future  deception  detection  research. 
What  follows  in  Tables  5  and  6  are  the  ANOVA  statistics  on 
the  constructs  and  individual  measures  z-score  data. 


Table  6,  ANOVA 

by  Granularity 

Topic 

MEASURES 

Topic 

M EASU RES 

F 

Sig 

F 

Sig 

NumSeMenees 

2.14 

,000  1 

Silent  Pauses 

237 

.000 

ContentWordDiv, 

1.38 

,087 

AmpMeandB 

2.40 

,000 

Complexity  Ratio 

1.72 

OKI 

Amp  Variety  Pascals 

2.27 

.000 

AvgWord  Length 

1.97 

.002 

PilchChange 

1.64 

.01  s 

CausationTerms 

1.68 

.013 

1  stppronovm 

!  37 

.090 

2ndp  pronoun 

1.51 

.040 

3 rdp pronoun 

2.41 

.000 

Motion  Words 

1  67 

.014 

In  order  to  run  the  ANOVA  on  the  constructs  the  data 
required  manipulation  so  the  aggregate  of  the  different 
measures  could  be  computed.  All  measures  were  given  a  z- 
score,  on  a  positive  scale  across  individual  measures  allowing 
for  a  meaningful  average  for  each  construct  for  each  level  of 
granularity. 


Tabic  5,  Construct  ANOVA 


CONSTRUCTS 

By  Topic 

F3  707 

Sig. 

Quantity 

1.527 

.037 

Specificity 

1 .440 

,062 

Uncertainty 

0.621 

,945 

Clarity 

1.208 

,207 

Immediacy 

1.858 

.004 

Affect 

1.222 

.194 

Cognitive  Processing 

1.478 

,049 

Fluency 

1.494 

,045 

Time  Duration 

1.131 

.289 

Time  Tempo 

1.690 

.013 

Intensity 

2.372 

,000 

Frequency 

- - - - - 

1.474 

.050 

ANOVA  results  on  the  constructs  was  very  strong.  There 
was  a  significant  effect  of  Suspicion  on  S  of  12  constructs 
(0.621  <  F3,  707  <  2.372,  p  <  .062).  The  strong  performance 
of  the  paralinguistic  constructs  is  encouraging  if  we  consider 
the  goal  of  automating  the  capture  and  processing  of  speech  for 
deceptive  measurement.  All  linguistic  measures  above  syllable 
counting,  require  a  speech  recognition  engine  [41],  Given  the 


As  seen  in  Table  6,  there  was  a  significant  effect  of 
Suspicion  on  the  13  of  41  measures  at  the  Topic  level  of 
granularity  (1.37  <F3,  707  <2.41,  p  <  .09). 

To  understand  how  the  data  behaved  over  time  graphical 
analysis  was  performed  and  revealed  promising  results.  The 
average  magnitude  for  cues  in  each  construct  was  graphed  on  a 
bar  chart  for  side-by-side  comparison.  One  example  is  given 
here,  the  Quantity  construct  clearly  increase  as  Suspicion 
increases  from  Truth  to  Deception  (Figure  3).  However,  the 
differences  within  the  degrees  of  evidence  are  not  clearly 
increasing.  This  may  not  be  of  concern  with  the  exception  o I 
the  w/o  Evidence  deception  scores.  One  explanation  for  this 
maybe  that  the  vv/o  Evidence  level  of  suspicion  had  a  very 
small  sample  size. 
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Figure  3,  Quantity 


In  addition  to  bar  graphs  to  examine  magnitude,  trend  lines 
were  drawn  to  examine  general  behavior  over  time.  Figure  4 
shows  one  example,  again  of  the  construct  Quantity  and  its 
measures.  It  shows  how  Quantity  decreases  almost  uniformly 
regardless  of  leve!  of  suspicion.  One  explanation  for  this 
pattern  could  be  fatigue  [42],  After  four  hours  the  subject 
could  just  be  tired  of  talking.  However,  there  is  a  stark 
difference  in  the  Quantity  spoken  when  comparing  truthful  vs 
deceptive  speech  which  stay  relatively  constant,  a  pattern  in 
and  of  itself. 


Figure  4,  Quantity  Line  Graph 

Figure  5  shows  a  comparison  of  truths  (blue)  to  lies  (red) 
over  time  for  the  three  measures  of  the  Quantity  construct 
Because  these  measures  are  highly  correlated  they  behave 
similarly  over  time. 
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Figure  5,  Quantity  Trendlines 


V.  Conclusion 

Based  on  the  above  it  is  reasonable  to  state  that  the 
research  question  was  supported  and  that  speech  cues  to 
deceptive  behavior  are  impacted  by  receiver’s  suspicion  during 
dyadic  interactions  in  reai-world  high-stakes  settings.  This 
validated  IDT  in  a  RWHS  setting.  This  study  also  look  at 


whether  measuiements  and  constructs,  developed  bv  previous 
researchers,  could  hold  up  under  a  RWHS  case.  The  ANOVA 
of  the  41  behavioral  cues  measured  31.7%  at  the  topic  level. 
Regression  also  showed  a  strong  relationship  between  the 
levels  of  suspicion  and  the  individual  measures  with  31.7%  at 
the  Topic  level  of  granularity  as  significant.  Given  the  poor 
quality  of  the  audio  data,  this  is  strong  support  for  utilizing 
these  measures  in  future  deception  detection  research. 

Several  measures  and  constructs,  utilized  and  validated  in 
existing  research,  were  explored  and  validated  in  this  study. 
However,  many  of  the  measures  and  their  constructs  were  not 
significant  predictors  of  deceptive  behavior  or  explained  only  a 
fraction  of  the  variance.  The  reason  for  their  poor  predictive 
power  could  be  explained  because  the  study  was  a  single  case 
and  the  fact  that  all  measurements  were  taken  from  an 
uncontrolled  environment.  However,  this  fact  does  add  weight 
to  those  measures  and  constructs  that  were  significant 
predictors  of  deceptive  behavior. 

In  regards  to  IDT,  one  contribution  of  this  study  is  a  better 
understanding  of  the  impact  suspicion  has  in  a  RWHS  setting. 
Hie  length  of  the  interaction  in  this  case  study  was  also  a  good 
opportunity  to  examine  IDT  and  how  a  lengthy  dyadic 
communication  can  be  dissected  into  reasonable  units  of 
analysis.  IDT  was  validated  to  the  extent  that  suspicion  plays  a 
role  in  senders  behavior  and  it  affected  cue  intensity.  It  is 
apparent  that  not  only  does  suspicion  play  a  central  role  in  IDT 
but  that  its  impact  on  deceptive  speech  behaviors  is  measurable 
in  a  RWHS  environment.  This  point  is  important  to  unlocking 
future  studies  involving  IDT.  suspicion,  and  RWHS  cases. 

Several  limitations  are  common  to  any  case  study.  In  the 
current  study  an  emphasis  was  made  to  limiting  research  only 
to  a  RWHS  environment,  this  raises  a  number  of  questions. 
Was  this  a  typical  high-stakes  interview?  Mr.  Perry  was  more 
than  a  suspect,  he  knew  the  FBI  had  evidence  against  him,  but 
he  did  not  know'  how  much  evidence  the  lead  detective  had 
against  him.  Before  the  interview  he  wanted  to  plea  down  to 
six  years  for  trafficking  in  child  pornography;  after  the 
interview  he  received  life  in  prison,  470  years  to  be  exact.  One 
could  argue  that  having  been  caught,  even  on  one  criminal 
charge,  he  did  not  think  he  had  much  to  lose  by  his  deception. 

The  nature  and  environment  of  this  real-world  case  is 
another  limitation  and  potential  area  for  further  study.  Longer, 
dyadic  communication  indicative  of  law  enforcement 
interviews  combined  with  a  lack  of  fine  granularity  of  episodes 
suggests  the  need  for  further  research  in  interview-style 
communications.  The  difficulty  is  two-fold;  longer  duration 
interviews  will  be  more  difficult  to  gather  in  a  controlled 
manner  simply  because  volunteers  are  not  going  to  sit  for  hours 
without  proper  compensation.  Secondly,  the  free-flowing 
nature  of  longer  communications  makes  controlling  the  study 
more  complex. 

The  exploratory  nature  of  the  study,  the  volume  of  data, 
and  the  numerous  methods  of  analysis  used  generated  many 
possibilities  for  future  research.  One  aspect  of  IDT  which 
should  be  examined  in  greater  detail  is  the  view  that  deception 
involves  strategic  and  non-strategic  behaviors.  This  study’s 


initial  view  into  a  RWHS  deceptive  case  did  not  look  for 
strategic  motives.  However,  such  an  examination  could 
produce  new  insightful  knowledge  about  deception, 
specifically  in  the  case  of  longer  more  realistic  dyadic 
interactions.  This  study  kept  IDT  at  the  forefront  when 
choosing  the  research  question.  However,  there  are  several 
theories  on  deceptive  behavior,  all  of  which  could  benefit  if 
looked  at  through  a  RWHS  case  study.  One  final  potential 
future  research  area  is  the  development  of  a  collection  of 
RWHS  deception  case  studies.  If  a  database  of  RWHS  cases  in 
which  ground  truth  is  established  could  be  collected,  it  would 
be  invaluable  to  the  field  of  deception  research. 
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